Background

The Tropical Cyclones Developmental Dataset was used to develop the Statistical Hurricane Intensity Prediction Scheme (SHIPS) for predicting changes in tropical cyclone (TC) intensities (DeMaria and Kaplan (1994)). The National Hurricane Center (NHC) uses SHIPS, along with other models, to generate predictions and guide official track and intensity forecasts (“NHC Track and Intensity Models” 2009). Traditionally, SHIPS forecasts have outperformed climatology and persistence forecasts since circa 1997. However, SHIPS has not performed well in Rapid Intensification (RI) events, defined as a rapid increase in maximum windspeed over 24 hours exceeding 30, 35, or 40 knots as described in Kaplan and DeMaria (2003).

Data Description

Focusing on TCs from the Atlantic Basin only, the raw data arrives in a file called lsdiaga_1982_2014_rean_sat_nbc_ts.dat which is described as Atlantic data with SHIPS predictors from either re-analysis or operational analysis with satellite variables when available. The file is one concatenated list of record sets for each storm case. Here, a case includes a current time observation (hour 0) in addition to 120 hours of forecast information (hours 6 to 120) and, in some cases, 12 hours of past information (hours -12 to -6). The case time points occur in 6 hour intervals. Each of the different cases begin with a line descriptor called HEAD and end with an end-line called LAST. Not all predictors are available for all years. To make the .dat file in a readable format, we treated each record as a fixed width format (fwf) table, skipping the header rows, with 24 columns of width 5. Once the fwf table was parsed, the table was transposed to place the attributes as columns and time points as rows. The missing value string 9999 is replaced with NAs and the parsed, transposed, and cleaned record is written to a .csv file for the sake a creating a bank of easily accessible records. As there are multiple records for one TC, the .csv file names are written as uniquestormID.recordnumber.csv.

The next step in preprocessing the data was to concatenate the hour 0 observations for each case for each storm. Before adding the observation to a master data frame, we crosschecked observations of corresponding time points. If the cases are recorded correctly, for a given storm hours 0 to 114 of the current case should be equivalent with hours 6 to 120 of the previous case for time dependent predictors. To keep track of case discrepancies, we appended a column call Match to the master data frame. Below is a detailed description of each of the raw attributes of the master data frame, mostly adapted from the predictor description file.

Attribute Name Description
ID chr. Storm identifier. The first two characters “AL” represent the Atlantic basin, the second two characters represent the sequence number of a TC in a certain year, and the remaining four characters represent the year when the TC happened
DATE POSIXct. Date of the storm in the format yymmdd
TIMESTAMP POSIXct. Date of the storm in the format yymmdd HH
RECORDNM num. The time intervals are currently in fixing. This is the record number from the original .dat file
VMAX int. Maximum wind surface (kts)
MSLP int. Minimum sea level pressure (hPa)
TYPE factor. Storm type type (0=wave, remnant low, dissipating low, 1 = tropical, 2 = subtropical, NA = extra-tropical).
HIST20-HIST120 int. Number of 6 hour periods that the storm max wind has been above 20, 25, … 120 kts.
DELV int. Intensity change relative to the storm start (kts)
INCV int. Intensity change relative to the previous 6 hour interval. Set to NA for land mass crossings
LAT int. The latitude in 10*degrees North of the approximate storm center
LON int. The longitude in 10*degrees West of the approximate storm center
CSST int. Climatological sea surface temperature (deg C*10)
CD20 int. Climatoligical depth (m) of 20 degree isotherm from 2005-2010 NCODA analyses
CD26 int. Climatoligical depth (m) of 26 degree isotherm from 2005-2010 NCODA analyses
COHC int. Climatoligical ocean heat content (kJ/cm^2) 2005-2010 NCODA analyses
DTL int. Distance to nearest major land mass (km)
RSST int. Reynolds sea surface temperature ( deg C*10)
PCHN int. Estimated ocean heat content(kJ/cm^2) from COHC and current sea surface temperature anomaly. Designed to fill in missing RHCN
U200 int. 200 hPa zonal wind speed (10*kts) for r = 200-800 km on average
U20C int. 200 hPa zonal wind speed (10*kts) for r = 0-500 km on average
V20C int. Vertical component of 200 hPa zonal wind speed(10*kts) for r = 200-800 km on average
E000 int. 1000 hPa equivalent potential temperature, \(\theta_e\) for r = 200-800 km on average (K)
EPOS int. Average \(\theta_e\) difference between a parcel lifted from the surface and its environment for r = 200-800 km on average (K)
ENEG int. Negative average \(\theta_e\) difference between a parcel lifted from the surface and its environment for r = 200-800 km on average, sign not included (K)
EPSS int. Negative average \(\theta_e\) difference between a parcel lifted from the surface and its environment for r = 200-800 km on average, sign not included with \(\theta_e\) compared with the saturated \(\theta_e\) of the environment (K)
ENSS int. Negative average \(\theta_e\) difference between a parcel lifted from the surface and its environment for r = 200-800 km on average, sign not included with \(\theta_e\) compared with the saturated \(\theta_e\) of the environment (K)
RHLO int. 850-700 hPa relative humidity(%) for 200-800 km
RHMD int. 700-500 hPa relative humidity(%) for 200-800 km
RHHI int. 500-300 relative humidity(%) for 200-800 km
PSLV int. Pressure of the center of mass of the layer where the storm motion best matches environmental flo. Used to calculate steering pressure as well (hPa)
Z850 int. 850 hPa vorticity (\(sec^{-1}*10^7\)) for r = 0-1000 km
D200 int. 200 hPa divergence (\(sec^{-1}*10^7\)) for r = 0-1000 km
REFC int. Relative eddy momentum flux convergence (\(m/s/day\)) for r = 100-600 km on average
PEFC int. Planetary eddy momentum flux convergence (\(m/s/day\)) for r = 100-600 km on average
T000 int. 1000 hPa temperature (deg C*10) 200-800 km average
R000 int. 1000 hPa relative humidity 200-800 km average
Z000 int. 1000 hPa height deviation (m) from the U.S. standard atmosphere
TLAT int. Latitude of 850 hPa vortex center in NCEP analysis (10*deg N)
TLON int. Longitude of 850 hPa vortex center in NCEP analysis (10*deg N)
TWAC int. Symmetric tangential wind at 850 hPa from NCEP analysis 0-600 kn average (\(m/sec*10\))
TWXC int. Maximum 850 hPa symmetric tangential wind at 850 hPa from NCEP analysis (\(m/sec*10\))
G150 int. Temperature perturbation at 150 hPa due to the symmetric vortex calculated from gradient thermal wind. Averaged from r=200 to 800 km center on input lat and lon (not always the model/analysis vortex position) (deg C*10)
G200 int. Temperature perturbation at 200 hPa due to the symmetric vortex calculated from gradient thermal wind. Averaged from r=200 to 800 km center on input lat and lon (not always the model/analysis vortex position) (deg C*10)
G250 int. Temperature perturbation at 250 hPa due to the symmetric vortex calculated from gradient thermal wind. Averaged from r=200 to 800 km center on input lat and lon (not always the model/analysis vortex position) (deg C*10)
V000 int. Tangential wind azimuthally averaged at r=500 km from (TLAT,TLON) If TLAT,TLON are not available, (LAT,LON) are used (\(m/sec*10\))
V850 int. 850 hPa tangential wind azimuthally averaged at r=500 km from (TLAT,TLON) If TLAT,TLON are not available, (LAT,LON) are used (\(m/sec*10\))
V500 int. 500 hPa tangential wind azimuthally averaged at r=500 km from (TLAT,TLON) If TLAT,TLON are not available, (LAT,LON) are used (\(m/sec*10\))
V300 int. 300 hPa tangential wind azimuthally averaged at r=500 km from (TLAT,TLON) If TLAT,TLON are not available, (LAT,LON) are used (\(m/sec*10\))
TGRD int. Magnitude of the temperature gradient between 850 and 700 hPa averaged from 0 to 500 km estimated from the geostrophic thermal wind (\(degC/m*10^7\))
TADV int. The temperature advection between 850 and 700 hPa averaged from 0 to 500 km from the geostrophic thermal wind (\(degC/sec*10^6\))
PENC int. Azimuthally averaged surface pressure at outer edge of vortex \(( (hPa-1000)*10)\)
SHDC int. Shear magnitude (kts*10) vs time (200-800 km) with vortex removed and averaged from 0-500 km relative to 850 hPa vortex center
SDDC int. Heading in degrees of above shear vector where westerly shear is valued at 90 degrees
SHGC int. Generalized 850–200 hPa shear magnitude (kts*10) (takes into account all levels) with vortex removed and averaged from 0-500 km relative to 850 hPa vortex center
DIVC int. Divergence (\(sec^{-1}*10^7\)) for r = 0-1000 km centered at 850 hPa vortex location
T150 int. 150 hPa temperature (deg C*10) versus time 200 to 800 km
T200 int. 200 hPa temperature (deg C*10) versus time 200 to 800 km
T250 int. 250 hPa temperature (deg C*10) versus time 200 to 800 km
SHRD int. 850-200 hPa shear magnitude (kts*10) vs time 200-800 km
SHRS int. 850-500 hPa shear magnitude (kts*10)
SHTS int. Heading above sheer vector (deg)
SHRG int. Generalized 850-200 hPa shear magnitude (kts*10) (takes into account all levels)
PENV int. 200 to 800 km average surface pressure \(((hPa-1000)*10)\)
VMPI int. Maximum potential intensity from Kerry Emanuel equation (kts)
VVAV int. Average (0 to 15 km) vertical velocity (\(m/s *100\)) of a parcel lifted from the surface where entrainment, the ice phase and the condensate weight are accounted for. Source note: Moisture and temperature biases between the operational and reanalysis files make this variable inconsistent in the 2001-2007 samples, compared to 2000 and before
VMFX int. VVAV with a density weighted vertical average
VVAC int. VVAV with soundings from 0-500 km with GFS vortex removed
HE07 undefined
HE05 undefined
IRXX int. Non-satellite GOES model predictors used to generate IR00
RD20 int. Ocean depth of the 20 deg C isotherm (m), from satellite altimetry data
RD26 int. Ocean depth of the 26 deg C isotherm (m) from satellite altimetry data
RHCN int. Ocean heat content (KJ/cm2) from satellite altimetry data
Match logical. Crosscheck for a given storm, where hours 0 to 114 of the current case should be equivalent with hours 6 to 120 of the previous case for time dependent predictors
IR00_AVG_200BT int. Average GOES 4 satellite brightness temp r=0-200 km (deg C *10)
IR00_STD_200BT int. Standard deviation of GOES 4 satellite brightness temp r=0-200 km (deg C *10)
IR00_AVG_300BT int. Average GOES 4 satellite brightness temp r=100-300 km (deg C *10)
IR00_STD_300BT int. Standard deviation GOES 4 satellite brightness temp r=100-300 km (deg C *10)
IR00_PCT_AREA_10BT int. Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -10 C
IR00_PCT_AREA_20BT int. Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -20 C
IR00_PCT_AREA_30BT Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -30 C
IR00_PCT_AREA_40BT int. Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -40 C
IR00_PCT_AREA_50BT int. Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -50 C
IR00_PCT_AREA_60BT int. Percent area r= 50-200 km of GOES 4 brightness temp \(<\) -60 C
IR00_MAX_BT int. Maximum brightness temp r = 0-30 km (deg C *10)
IR00_AVG_30BT int. Average brightness temp r = 0-30 km (deg C *10)
IR00_RADIUS_MAXBT int. Radius of maximum brightness temp (km)
IR00_MIN_20BT int. Minimum brightness temp r = 20-120 km (deg C *10)
IR00_AVG_20BT int. Average brightness temp r = 20-120 km (deg C *10)
IR00_RADIUS_MINBT int. Radius of minimum brightness temp (km)
IR3* ints. Same as the IR00 vars three hours before initial case time
JDATE int. The absolute value of the Julian date minus the peak date of the season according to DeMaria and Kaplan (1994)
POT int. The intensification potential, that is VMPI - VMAX (kts)

Exploratory Analysis

Without any further cleaning, the data is 10705 rows of realtime instances and 135 columns of attributes to each instance. There are 460 unique TCs ranging from 1982-06-02 to 2014-10-28. The storm paths are again limited to the Atlantic Basin as seen below.

Detailed Variable Exploration

Checks and Inconsistencies

In the cross check for time point inconsistencies, there are a total of 38 inconsistencies. As documented in Jankulak (2012), 18 of these inconsistencies belong to the REFC attribute. Another 10 of these inconsistencies are due to the TYPE attribute. The remaining 10 inconsistencies belong to 9 different storms, where the storm is missing at least one day of data. For example, Hurricane Nadine, the longest storm in this set of records, has a missing cases between 120921 and 120923. We can see the missing data in the lat/long path below

Next, a check to see that the storm history variables beginning with HIST are in fact a cumulative history. For the first TC, the HIST vars look like:

A little messier, all the storms look like:

Checking numerically by row that the history variables are different by either 1 or 0 from the previous variable observation by storm, we find a total of 16 storms with history issues, including the 9 storms with time point inconsistencies. These storm IDs,

## [1] "AL042013" "AL052011" "AL072012" "AL091997" "AL092014" "AL132005"
## [7] "AL282005"

show history inconsistencies, which under further investigation, is a result of date inconsistencies. As it would happen, this data set only includes cases that qualify as tropical or sub-tropical storms. If the storm weakens, it is no longer classified into either of these two categories and the data is no longer available. If the storm picks up speed, the tracking resumes explaining what happened to NADINE and the history inconsistent storms. A small note, there are also inconsistencies between storms in the case-dependent TYPE and REFC attributes. Without these variables included in the checks and accounting for missing dates, all inconcsistencies are resolved.

VMAX and TIME

The storm instances have a binary label, 1 indicating RI and 0 indicating nonRI (for any increase in windspeed greater than 30 kts within 24 hrs), in addition to a VMAX_Diff column that measures the future 24 hour change in intensity relative to the current time point. An NA in the RI and VMAXDiff column signify that the data are not available for the timepoints needed to calculate these attributes (storms like NADINE). The distribution of maximum intensities per time interval for the nonRI and RI storms are shown below. Keep in mind that NAs are likely equivalent to nonRI events as the storm has weakened.

VMAX

In terms of class imbalance, there are 540 RI instances compared to 10068 nonRI instances. The VMAX Diff shown below ranges from NA to NA.

Time Component

The longest storm in the data is AL142012 lasting a total of 558 hours. The shortest storm is AL052010 lasting a total of 6 hours. The distribution of storm duration for all storms, colored by the number of RI events in each storm, looks like:

In terms of the time series view, a snapshot look at VMAX vs TIME:

Similarly, the 24 change in maximum windspeed varies widely for the storms that contain RI events.

Dissipating Stroms

Dealing specifically with storms where we have missing data because of dissipation, there are exactly 25 dissipating storms out of 460 total storms. The dissipating storms are plotted below

Missing Data

To summarize the amount of missing data in eac of the attributes, we first subset to the timepoints with and with GOES satelite information available. As of March 2009, the satelites were backfilled to 1983, wheras they were only available from 1995 previously. A summary of the 5 storms up to 1983:

And a summary of the 306 storms post 1983:

To drill down specifically, the largest amount of missing data comes from satelites, with 150 of the 306 post `83 storms missing IR information.

References

DeMaria, Mark, and John Kaplan. 1994. “A Statistical Intensity Prediction Scheme (SHIPS) for the Atlantic Basin.” Weather and Forescating 9 (June): 209–20.

Jankulak, Michael L. 2012. “Prediction of Rapid Intensity Changes in Tropical Cyclones Using Associative Classification.” PhD thesis, University of Miami.

Kaplan, John, and Mark DeMaria. 2003. “Large-Scale Characteristics of Rapidly Intensifying Tropical Cyclones in the North Atlantic Basin.” Weather and Forecasting 18 (6): 1093–1108.

“NHC Track and Intensity Models.” 2009.